Impacts on Miami Housing Prices

603
Project
Author

Quinn He

Published

October 19, 2022

Code
library(tidyverse)
library(lmtest)
library(sf)
library(mapview)
library(GGally)
library(stargazer)


knitr::opts_chunk$set(echo = TRUE)

Data Read-in

Code
miami_housing <- read_csv("~/Documents/DACSS Program/data/miami-housing.csv")
Rows: 13932 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (17): LATITUDE, LONGITUDE, PARCELNO, SALE_PRC, LND_SQFOOT, TOT_LVG_AREA,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Introduction and Descriptive Statistics

Housing prices are always difficult to predict and can fluctuate randomly due to various variables or economic events. At the core, location is an extremely important factor in dictating the price of houses, and always will be. The purpose of this project is not to break new ground in the study of house prices, but to implement multiple regression techniques in determining how important home location is for single-family houses in Miami. I am particularly interested in Miami due to its proximity to the ocean and because Florida is a flat, low elevation state. With the threat of climate change and the increase in hurricanes every year, I want to examine Miami’s housing market in without climate variables included. For a future study, I would be interested to see if my discoveries hold up with increase in sea level and hurricane variables implemented.

Previous studies have pointed to outside factors like unemployment rate, mortgage rates, stocks, etc. as determinants of house prices, but for this study location is of primary concern.

The price of houses in Miami are influenced by their location, which includes the distance from various desirable locations.

The dataset chosen for the regression analysis contains information on 13,932 single-family homes sold in Miami.

Below are the names of the columns and what each one represents:

PARCELNO: unique identifier for each property. About 1% appear multiple times. SALE_PRC: sale price (\() LND_SQFOOT: land area (square feet) TOTLVGAREA: floor area (square feet) SPECFEATVAL: value of special features (e.g., swimming pools) (\)) RAIL_DIST: distance to the nearest rail line (an indicator of noise) (feet) OCEAN_DIST: distance to the ocean (feet) WATER_DIST: distance to the nearest body of water (feet) CNTR_DIST: distance to the Miami central business district (feet) SUBCNTR_DI: distance to the nearest subcenter (feet) HWY_DIST: distance to the nearest highway (an indicator of noise) (feet) age: age of the structure avno60plus: dummy variable for airplane noise exceeding an acceptable level structure_quality: quality of the structure month_sold: sale month in 2016 (1 = jan) LATITUDE LONGITUDE

I change a few of the variable names to make calling into functions easier because I do not always want to refer to all caps lettering.

Code
miami_housing <- miami_housing %>% 
  rename("latitude" = "LATITUDE",
         "longitude" = "LONGITUDE",
         "sale_price" = "SALE_PRC",  
         "land_sqfoot" = "LND_SQFOOT",  
         "floor_sqfoot" = "TOT_LVG_AREA",
         "special_features" = "SPEC_FEAT_VAL",  
         "dist_2_nearest_water" = "WATER_DIST",  
         "dist_2_biz_center" = "CNTR_DIST",  
         "dis_2_nearest_subcenter"= "SUBCNTR_DI", 
         "home_age" = "age") 

Exploratory Analysis

Before any model fitting and analysis, it would be beneficial to get an overall view of the data I am dealing with. With the summary() function I can get a large table that lets me look at all the different summary statistics of each variable. First off, the special features variable is denoted in price of a certain feature to the home, be it a swimming pool, solar power, or a hot tub. Control variables will help determine how important the distance to ocean variable is on the sale price.

Code
summary(miami_housing)
    latitude       longitude         PARCELNO           sale_price     
 Min.   :25.43   Min.   :-80.54   Min.   :1.020e+11   Min.   :  72000  
 1st Qu.:25.62   1st Qu.:-80.40   1st Qu.:1.079e+12   1st Qu.: 235000  
 Median :25.73   Median :-80.34   Median :3.040e+12   Median : 310000  
 Mean   :25.73   Mean   :-80.33   Mean   :2.356e+12   Mean   : 399942  
 3rd Qu.:25.85   3rd Qu.:-80.26   3rd Qu.:3.060e+12   3rd Qu.: 428000  
 Max.   :25.97   Max.   :-80.12   Max.   :3.660e+12   Max.   :2650000  
  land_sqfoot     floor_sqfoot  special_features   RAIL_DIST      
 Min.   : 1248   Min.   : 854   Min.   :     0   Min.   :   10.5  
 1st Qu.: 5400   1st Qu.:1470   1st Qu.:   810   1st Qu.: 3299.4  
 Median : 7500   Median :1878   Median :  2766   Median : 7106.3  
 Mean   : 8621   Mean   :2058   Mean   :  9562   Mean   : 8348.5  
 3rd Qu.: 9126   3rd Qu.:2471   3rd Qu.: 12352   3rd Qu.:12102.6  
 Max.   :57064   Max.   :6287   Max.   :175020   Max.   :29621.5  
   OCEAN_DIST      dist_2_nearest_water dist_2_biz_center
 Min.   :  236.1   Min.   :    0        Min.   :  3826   
 1st Qu.:18079.3   1st Qu.: 2676        1st Qu.: 42823   
 Median :28541.8   Median : 6923        Median : 65852   
 Mean   :31691.0   Mean   :11960        Mean   : 68490   
 3rd Qu.:44310.7   3rd Qu.:19200        3rd Qu.: 89358   
 Max.   :75744.9   Max.   :50400        Max.   :159976   
 dis_2_nearest_subcenter    HWY_DIST          home_age       avno60plus     
 Min.   :  1463          Min.   :   90.2   Min.   : 0.00   Min.   :0.00000  
 1st Qu.: 23996          1st Qu.: 2998.1   1st Qu.:14.00   1st Qu.:0.00000  
 Median : 41110          Median : 6159.8   Median :26.00   Median :0.00000  
 Mean   : 41115          Mean   : 7723.8   Mean   :30.67   Mean   :0.01493  
 3rd Qu.: 53949          3rd Qu.:10854.2   3rd Qu.:46.00   3rd Qu.:0.00000  
 Max.   :110554          Max.   :48167.3   Max.   :96.00   Max.   :1.00000  
   month_sold     structure_quality
 Min.   : 1.000   Min.   :1.000    
 1st Qu.: 4.000   1st Qu.:2.000    
 Median : 7.000   Median :4.000    
 Mean   : 6.656   Mean   :3.514    
 3rd Qu.: 9.000   3rd Qu.:4.000    
 Max.   :12.000   Max.   :5.000    
Code
miami_housing %>% 
  ggplot(aes(OCEAN_DIST, sale_price))+
  geom_point() +
  labs(title = "Price Relationship between Final Sale Price and Distance to Ocean", x = "Distance(ft) to Ocean", y = "Sale Price($)")

The graph indicates homes that are closer to the ocean tend to have a higher sale price than homes that are farther away. I notice at the 40,000 and 60,000 mark of distance, there is a general spike in house prices, but I cannot determine what that would be. As stated previously, there are many other factors that can contribute to housing price, but the graph shows distance to ocean is a clear predictor. I will now log this model to correct for the U shaped trend these points take.

Code
#logged
miami_housing %>% 
  ggplot(aes(OCEAN_DIST, log(sale_price)))+
  geom_point() +
  labs(title = "Price Relationship between Final Sale Price and Distance to Ocean", x = "Distance(ft) to Ocean", y = "Sale Price($)")

I want to get a quick summary to understand how age impacts the structure of the home and in turn impacts the price. As I expected, homes with the lowest quality of their structure are on average the oldest and cheapest. For the most part, this trend follows the same for the other home ages and structure quality. Structure 3 is interesting because homes there are the youngest and also the most expensive. I included the distance to ocean variable to see how this relates to the research project and I found structure 3 is also the closest to the ocean, which may indicate a substantial impact on price.

Since newer homes that have the highest quality structure are the second most expensive, it also seems structure quality plays a significant role in the determination of house price when distance to ocean is not valued.

Code
miami_housing %>% 
  group_by(structure_quality) %>% 
  summarise(mean(home_age),  mean(sale_price), mean(OCEAN_DIST))
# A tibble: 5 × 4
  structure_quality `mean(home_age)` `mean(sale_price)` `mean(OCEAN_DIST)`
              <dbl>            <dbl>              <dbl>              <dbl>
1                 1             66.1            162640.             23426.
2                 2             27.6            269672.             25564.
3                 3             15.1           1847250               7103.
4                 4             32.0            382571.             35087.
5                 5             28.8            743189.             32269.
Code
miami_housing %>% 
  group_by(structure_quality)
# A tibble: 13,932 × 17
# Groups:   structure_quality [5]
   latitude longitude   PARCELNO sale_…¹ land_…² floor…³ speci…⁴ RAIL_…⁵ OCEAN…⁶
      <dbl>     <dbl>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1     25.9     -80.2    6.22e11  440000    9375    1753       0   2816.  12811.
 2     25.9     -80.2    6.22e11  349000    9375    1715       0   4359.  10648.
 3     25.9     -80.2    6.22e11  800000    9375    2276   49206   4413.  10574.
 4     25.9     -80.2    6.22e11  988000   12450    2058   10033   4585   10156.
 5     25.9     -80.2    6.22e11  755000   12800    1684   16681   4063.  10837.
 6     25.9     -80.2    6.22e11  630000    9900    1531    2978   2391.  13017 
 7     25.9     -80.2    6.22e11 1020000   10387    1753   23116   3277.  11668.
 8     25.9     -80.2    6.22e11  850000   10272    1663   34933   3112.  11718.
 9     25.9     -80.2    6.22e11  250000    9375    1493   11668   2082.  13044.
10     25.9     -80.2    6.22e11 1220000   13803    3077   34580   2938.  11918.
# … with 13,922 more rows, 8 more variables: dist_2_nearest_water <dbl>,
#   dist_2_biz_center <dbl>, dis_2_nearest_subcenter <dbl>, HWY_DIST <dbl>,
#   home_age <dbl>, avno60plus <dbl>, month_sold <dbl>,
#   structure_quality <dbl>, and abbreviated variable names ¹​sale_price,
#   ²​land_sqfoot, ³​floor_sqfoot, ⁴​special_features, ⁵​RAIL_DIST, ⁶​OCEAN_DIST
Code
ggplot(aes(y = home_age, x = structure_quality))+
  geom_boxplot()
Error in `fortify()`:
! `data` must be a data frame, or other object coercible by `fortify()`, not an S3 object with class uneval.
Did you accidentally pass `aes()` to the `data` argument?
Code
p <- miami_housing %>% 
  ggplot(aes(sale_price))+
  geom_histogram()+
  labs(title = "Distribution of Sale Price($) for Single Family Homes in Miami", x = "Sale Price($)", y = "Count")

library(scales)

Attaching package: 'scales'
The following object is masked from 'package:purrr':

    discard
The following object is masked from 'package:readr':

    col_factor
Code
p + scale_x_continuous(labels = label_comma())
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This visualization depicts the distribution of homes based on their sale price. Most of the prices of homes fall slightly below the 500,000 mark at around 400,000 dollars as I would have to guess. A few homes are well past the 1 million and 2 million dollar mark.

The map view is very cluttered, but it is an interactive map that allows me to click on the data point and look at all the information associated with that house. For example, data points in Key Biscayne range from about $1.6 to $2.6 million. Most of the homes are only ~2000 feet away from the ocean. The case is the same for houses in Miami Beach, Surf Side, and Sunny Isles Beach.

Code
#library(RColorBrewer)
mapview(miami_housing, xcol= "longitude", ycol = "latitude", legend = mapviewGetOption("sale_price"), crs = 4269, grid = FALSE)

Research Question

Does the adage of, location, location, location hold true in Miami? By creating regression models with the variables associated with location of single family homes in Miami, I should be able to see how important location is to house prices.

Hypothesis

ALternative Hypothesis: Location related variables (distance to x) account for an Adjusted R Squared of 0.5 or more, 50% of the variance.

Alternative Hypothesis:

#Analysis

I first want to check for multicollinearity between the variables because if variables I use are correlated with one another, it could skew my results, leading to falsely inflated p values and standard errors. The “dist_2” variables are the ones to watch out for. Discuss more

Code
ggpairs(miami_housing, columns = 8:13)

From the plots above, the distance to the business center and distance to the subcenter (subway) are highly correlated at 0.76. My theory for high correlation is these two locations are in the same area or extremely close two it. Other variables are moderately correlated with one another, but not to the extent where I am worried about interaction.

Hypothesis Testing through Model Comparisons

The model below contains only distance to ocean as a predictor of house price, the variable associated with the hypothesis.

Code
m1 <- lm(sale_price ~ OCEAN_DIST, data = miami_housing)

summary(m1)

Call:
lm(formula = sale_price ~ OCEAN_DIST, data = miami_housing)

Residuals:
    Min      1Q  Median      3Q     Max 
-425896 -185541  -69660   72751 2151101 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.569e+05  5.324e+03  104.60   <2e-16 ***
OCEAN_DIST  -4.952e+00  1.469e-01  -33.72   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 305000 on 13930 degrees of freedom
Multiple R-squared:  0.07545,   Adjusted R-squared:  0.07538 
F-statistic:  1137 on 1 and 13930 DF,  p-value: < 2.2e-16
Code
m4 <- lm(sale_price ~ OCEAN_DIST + dist_2_nearest_water + dis_2_nearest_subcenter + HWY_DIST, data = miami_housing)

I’ve gotten rid of the distance to business variable since it’s p value is not significant to the analysis. Here I am hesitant to get rid of another variable because they all now contribute to the regression equation with highly significant p values.

Code
m3 <- lm(sale_price ~ RAIL_DIST + OCEAN_DIST + dist_2_nearest_water + dis_2_nearest_subcenter + HWY_DIST, data = miami_housing)
Code
m2 <- lm(sale_price ~ OCEAN_DIST + floor_sqfoot, data = miami_housing)

The m4 model is used to have floor square footage of the home as a control variable in model comparisons.

Code
m3 <- lm(sale_price ~ ., data = miami_housing)


stargazer(m1, m2, m3, type = "text")

===============================================================================================================
                                                          Dependent variable:                                  
                        ---------------------------------------------------------------------------------------
                                                              sale_price                                       
                                    (1)                          (2)                           (3)             
---------------------------------------------------------------------------------------------------------------
latitude                                                                                 -441,794.000***       
                                                                                          (53,131.310)         
                                                                                                               
longitude                                                                               1,416,574.000***       
                                                                                          (113,783.300)        
                                                                                                               
PARCELNO                                                                                   -0.00000***         
                                                                                             (0.000)           
                                                                                                               
land_sqfoot                                                                                 3.475***           
                                                                                             (0.297)           
                                                                                                               
OCEAN_DIST                       -4.952***                    -4.360***                     1.148***           
                                  (0.147)                      (0.108)                       (0.384)           
                                                                                                               
dist_2_nearest_water                                                                        -1.028***          
                                                                                             (0.236)           
                                                                                                               
dist_2_biz_center                                                                           1.044***           
                                                                                             (0.314)           
                                                                                                               
dis_2_nearest_subcenter                                                                     -4.192***          
                                                                                             (0.313)           
                                                                                                               
HWY_DIST                                                                                    3.906***           
                                                                                             (0.280)           
                                                                                                               
home_age                                                                                  -2,055.036***        
                                                                                            (91.291)           
                                                                                                               
avno60plus                                                                               -118,389.700***       
                                                                                          (12,355.280)         
                                                                                                               
month_sold                                                                                   161.240           
                                                                                            (436.454)          
                                                                                                               
structure_quality                                                                         73,073.940***        
                                                                                           (1,668.290)         
                                                                                                               
floor_sqfoot                                                  255.466***                   190.761***          
                                                               (2.331)                       (2.437)           
                                                                                                               
special_features                                                                            2.974***           
                                                                                             (0.125)           
                                                                                                               
RAIL_DIST                                                                                   4.929***           
                                                                                             (0.275)           
                                                                                                               
Constant                       556,875.800***                12,346.350*               125,032,873.000***      
                                (5,323.914)                  (6,316.385)                (10,202,394.000)       
                                                                                                               
---------------------------------------------------------------------------------------------------------------
Observations                       13,932                       13,932                       13,932            
R2                                 0.075                        0.504                         0.714            
Adjusted R2                        0.075                        0.504                         0.714            
Residual Std. Error       305,024.700 (df = 13930)     223,506.700 (df = 13929)     169,785.000 (df = 13915)   
F Statistic             1,136.729*** (df = 1; 13930) 7,066.145*** (df = 2; 13929) 2,169.588*** (df = 16; 13915)
===============================================================================================================
Note:                                                                               *p<0.1; **p<0.05; ***p<0.01

A base regression model with every variable in it shows an adjusted R squared of 0.71. This does not indicate best fit, but it’s valuable to see the regression model with every variable.

Diagnostic Plots

Code
logged_m1 <- lm(log(sale_price) ~ log(OCEAN_DIST), data = miami_housing)

logged_m4 <- lm(log(sale_price) ~ log(OCEAN_DIST) + log(floor_sqfoot), data = miami_housing)

summary(m4)

Call:
lm(formula = sale_price ~ OCEAN_DIST + dist_2_nearest_water + 
    dis_2_nearest_subcenter + HWY_DIST, data = miami_housing)

Residuals:
    Min      1Q  Median      3Q     Max 
-590076 -159555  -54210   74207 2089346 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)              5.603e+05  6.564e+03   85.37   <2e-16 ***
OCEAN_DIST              -2.283e+00  1.692e-01  -13.49   <2e-16 ***
dist_2_nearest_water    -3.198e+00  2.519e-01  -12.70   <2e-16 ***
dis_2_nearest_subcenter -3.830e+00  1.205e-01  -31.77   <2e-16 ***
HWY_DIST                 1.394e+01  4.382e-01   31.81   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 281800 on 13927 degrees of freedom
Multiple R-squared:  0.211, Adjusted R-squared:  0.2108 
F-statistic: 931.4 on 4 and 13927 DF,  p-value: < 2.2e-16
Code
AIC(m4)
[1] 389208.4
Code
AIC(m1)
[1] 391412
Code
AIC(m3)
[1] 375102.6
Code
AIC(m2)
[1] 382748.7
Code
AIC(m5)
Error in AIC(m5): object 'm5' not found
Code
summary(m5)
Error in summary(m5): object 'm5' not found
Code
glimpse(miami_housing)
Rows: 13,932
Columns: 17
$ latitude                <dbl> 25.89103, 25.89132, 25.89133, 25.89176, 25.891…
$ longitude               <dbl> -80.16056, -80.15397, -80.15374, -80.15266, -8…
$ PARCELNO                <dbl> 622280070620, 622280100460, 622280100470, 6222…
$ sale_price              <dbl> 440000, 349000, 800000, 988000, 755000, 630000…
$ land_sqfoot             <dbl> 9375, 9375, 9375, 12450, 12800, 9900, 10387, 1…
$ floor_sqfoot            <dbl> 1753, 1715, 2276, 2058, 1684, 1531, 1753, 1663…
$ special_features        <dbl> 0, 0, 49206, 10033, 16681, 2978, 23116, 34933,…
$ RAIL_DIST               <dbl> 2815.9, 4359.1, 4412.9, 4585.0, 4063.4, 2391.4…
$ OCEAN_DIST              <dbl> 12811.4, 10648.4, 10574.1, 10156.5, 10836.8, 1…
$ dist_2_nearest_water    <dbl> 347.6, 337.8, 297.1, 0.0, 326.6, 188.9, 0.0, 1…
$ dist_2_biz_center       <dbl> 42815.3, 43504.9, 43530.4, 43797.5, 43599.7, 4…
$ dis_2_nearest_subcenter <dbl> 37742.2, 37340.5, 37328.7, 37423.2, 37550.8, 3…
$ HWY_DIST                <dbl> 15954.9, 18125.0, 18200.5, 18514.4, 17903.4, 1…
$ home_age                <dbl> 67, 63, 61, 63, 42, 41, 63, 21, 56, 63, 64, 51…
$ avno60plus              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ month_sold              <dbl> 8, 9, 2, 9, 7, 2, 2, 9, 3, 11, 2, 11, 7, 7, 9,…
$ structure_quality       <dbl> 4, 4, 4, 4, 4, 4, 5, 4, 4, 5, 4, 2, 2, 2, 5, 2…

Unlogged m1 model

Code
par(mfrow = c(2,3)); plot(m4, which = 1:6)

The residuals vs fitted plot indicates some heteroskedasticity with the fanning out of the residuals as the values increase. As the fitted values increase, so does the variance of the residuals which is a problem because all the values should have equal variance across the plot. I can fix this with a transformation of the dependent variable.

Normal QQ is nonlinear demonstrating my data has some extreme values.

The Cook’s distance graph indicates 2 outliers.

Logged m1 model

Code
par(mfrow = c(2,3)); plot(logged_m4, which = 1:6)

In the second diagnostic plot, I logged the sale price variable to hopefully correct the assumptions of linearity since a few appear to be violated from the unlogged plots. ## Results/Conclusion

Home distance to the ocean does not have as significant an impact on home as initially thought in this analysis. Based off the first plot of sale price and distance to ocean, I